1. Introduction

Among the variety of emerging medicines, nucleic acids are a novel therapeutic modality with significant potential. Nucleic acid therapies are in development for a wide variety of maladies including cancer, protein replacement, and infectious disease (Figure 1-1).1 Most recently, two mRNA vaccines were approved by the FDA and administered to the broader population as a part of the response to the COVID-19 pandemic in 2020.2 Despite the promise of this new class of medicines, several key challenges remain to full realization of its potential. Compared to other drug subtypes, naked nucleic acid (especially mRNA) is very fragile and degrades in a matter of minutes upon administration. Additionally, nucleic acids cannot cross freely into cells to reach the intracellular machinery required to leverage its therapeutic effect.3

Figure 1-1: Summary of various tissue targets and therapeutic applications of nucleic acid therapeutics. Adopted from Reference 1.

Collectively, these challenges are addressed through the formulation of nucleic acid with other biologically active molecules to circumvent these barriers. Most commonly, the nucleic acid is combined with water insoluble molecules known as lipids to form a lipid nanoparticle (LNP).4 The fabrication process for nucleic acid LNPs is highly intricate and requires precise control over relevant parameters. Specifically of interest is the relative proportion of the lipid components in the LNP as this has been shown to dramatically impact LNP pharmacokinetics and pharmacodynamics across several preclinical studies.5 Additionally, other process parameters such when the nucleic acid is mixed with the lipid and how much nucleic acid is mixed with lipid are of particular interest and should be optimized depending on the therapeutic application.

In this project, data that were collected as part of a formulation optimization study for a novel polymer/lipid hybrid delivery system will be analyzed. The experiment was conducted over the course of a month in July 2021 on site at AstraZeneca in Gaithersburg, MD and consisted of 45 independent formulation ratios of lipid components to span the design space. The formulations were evaluated for the biophysical parameters of size and nucleic acid incorporation efficiency. These formulations were tested for in vitro delivery of two nucleic acid types: mRNA and DNA. Owing to the tissue specificity imparted on LNP therapeutics derived from the relative ratios of formulation components, these samples were tested on a variety of cell types including muscle, liver, lung, and immune. Two mRNA incorporation strategies were utilized to localize the nucleic acid cargo in the center or the surface of the particle. Additionally, 3 nucleic acid/lipid ratios were tested on each cell type. Cells of each tissue type of interest were incubated with LNPs carrying nucleic acids encoding fluorescent protein. The fluorescent signal from this protein was tracked with live cell imaging every 6 hours for 72 hours. Using this experiment data a series of research questions will be answered pertaining to the independent variable impact on transfection response. Additionally, the relationship between the dependent variables will be evaluated. Continued analysis of formulation process parameters will allow for more targeted and effective nucleic acid therapeutics.

Research Questions:

There are 4 research questions that this analysis will address:

1. Is there an optimal time point and dosing for sampling that would capture the relative difference in the treatment?

Determination of an optimal time point and dosing could be useful in reducing the amount of data that needs to be collected for each transfection, thus minimizing long term use of computer storage space for large image files collected from each scan. The optimal time point and dose will be determined primarily with exploratory data analysis and multivariate linear regression modeling.

2. How do the individual variables of cell type, nucleic acid type, and incorporation method impact the transfection reading?

The binary categorical variables of nucleic acid incorporation strategy (encapsulation or adsorption) and type (DNA and mRNA) will be evaluated in the context of cell type to determine the optimal combination. This will be accomplished with ANOVA test of the transfection response with respect to these explanatory variables.

3. How do individual formulation components impact the response in transfection?

Relative percentages of the lipids in the nucleic acid formulation can have significant impact on performance, especially with respect to cell type. The impact of each of these formulation components will be assessed with Scheffe polynomial modeling of the transfection response with respect to the formulation components.

4. Can size/incorporation efficiency predict transfection performance?

It would be useful and resource saving if the transfection performance could be predicted by the biophysical properties of the formulation collected upstream of a transfection experiment. To determine if this is possible, a multivariate linear regression model will be constructed for average formulation transfection response with respect to measured size and nucleic acid incorporation efficiency.

2. Data Description

Description of Initial Data Files

The data for this analysis comes from a series of in vitro experiments aimed at characterizing the biophysical properties and transfection efficacy of 45 unique formulations for mRNA and DNA delivery. The formulations consist of 5 components (DOTAP, MC3, DMG-PEG, DSPC, and cholesterol) and a formulation ID was assigned to each unique combination of these components. The formulation levels were computed as a D-Optimal design in the statistical software JMP.6 The molar percentage of each component is given along with the corresponding formulation ID in the file Formulation_Components.csv.

The formulations were prepared in a 96 well plate format by a Hamilton vantage and is given in Figure 2-1.

Figure 2-1: The 96 well layout for synthesis of the formulations. The raw data for size and incorporation efficiency refers to the well, not the formulation ID. The mapping of well to formulation id is given as the number in each circle and the incorporation type is given by the letter (E = encapsulated and A = adsorbed).

The size of the nanoparticles in the formulations was assessed by DLS and was directly stamped onto a DLS plate and the output file was a csv of an 8x12 table where each entry corresponds to the size of the formulation in that well (see size_mrna.csv for an example). The size was evaluated separately for mRNA and DNA.

The incorporation efficiency was evaluated through fluorophore labeled nucleic acid retention analysis where the particles were mixed with fluorescently labeled nucleic acid, separated from the solution, and the fluorescent content of the solution was evaluated. The fluorescent signal was collected in a similar manner to the size data in a 96 well plate format (see incorporation_mrna_raw.csv for an example). In order to correlate these raw fluorescence signals to nucleic acid amounts, a standard curve of fluorescence signals vs. known nucleic acid concentrations was measured and loaded as separate csv files (see incorporation_mrna_standard.csv for an example).

The transfections were completed on individual 384-well plates of each individual cell type- lung, liver, immune, and muscle. Each plate was treated with all 45 formulations made with one nucleic acid type (mRNA or DNA). The transfections were set up in duplicate (2 plates per condition). The formulations were stamped onto the cells in a quadrant format where each formulation was added to the cells at a high, medium, or low dose based on the position in a 2x2 square as shown in Figure 2-2.

Figure 2-2: Quadrant mapping strategy for automated formulation plating. Each 2 x 2 block of wells on the 384 well plate was stamped with the high dose of the formulation in the top left corner, the medium dose in the bottom left corner, and the low dose in the top right corner.

The plates were then loaded into an IncuCyte fluorescent microscope and a series of fluorescent microscopy images of each plate were collected every 6 hours for 72 hours. The images were quantified for total fluorescent signal to result in a transfection efficiency response for each well at each timepoint. The data for each plate were then exported into csv files which consisted of 40 x 386 tables for each cell type/nucleic acid type where each observation is a time point and each column (excluding the first two column which contains data about the time points evaluated) corresponding to an individual well in the order (A1, B1, C1…..N24, O24, P24).

Data Wrangling

Given the structure of the raw data as collected from the instruments used to evaluate response variables, substantial data wrangling is required to construct the main data frame that is required for analysis. The following is the description of steps that were performed to prepare the data for cleaning and analysis.

Install/load libraries

Required packages for the analysis were installed and the relevant libraries were loaded into R.

Write a function to be called to parse transfection readings into universal data frame structure

The transfection data is identical for the individual plates, but not conducive to analysis. To address this, a function was written that could be repeatedly called to reorganize the data from each plate into a universal format where each row is an individual transfection reading, and each column is a variable of interest in the experiment (nucleic acid type, incorporation strategy, etc.).

Write a function to parse the size data

Similarly, the size data must be reorganized to match the data structure of the transfection files. As this must be completed for both mRNA and DNA formulations, a function was written to build the appropriate data frames.

Write a function to parse the incorporation efficiency data

The incorporation efficiency data must also be reorganized, similar to the size data. The raw fluorescent readings were also converted to nucleic acid concentrations. This was accomplished by fitting a linear model to the standard curve data and then using this model to calculate the nucleic acid concentration for a given fluorescence reading. The incorporation efficiency was calculated by the following formula: \[ \left(1 - \frac{\text{Well Concentration}}{\text{Mixing Concentration}}\right) \times 100\% \]

Read in data from CSVs

The individual data files were then read into R as data frames.

Parse the transfection data

Each raw data frame from the transfection experiment was then reorganized into an appropriate data structure with the function written above.

Parse the formulation size and incorporation efficiency data

Each raw data frame from the size and incorporation efficiency experiments were then reorganized into the appropriate data structure with the associated functions written above.

Bind and join the individual parsed data frames into one data frame.

The parsed transfection data frames were bound by rows into one large transfection data frame. A similar operation was performed on the mRNA/DNA size and incorporation efficiency data. The transfection, size, incorporation efficiency, and formulation component data frames were then inner joined to produce the final wrangled data frame. Note that some observations in the transfection data frame were lost because the wells in the plates corresponding to formulation IDs 46, 47, and 48 did not receive treatment and did not have matching formulation component data. The individual columns were reorganized to facilitate data frame inspection.

Mutate the transfection signal related variables to a logarithmic scale

For transfection results, the values usually have a large range that does not conform to a normal distribution. This was confirmed by analyzing the summary statistics of the tfx_response variable (min = 0, max = 16500000, median = 2848, mean = 193840). To facilitate downstream data analysis, the tfx_response was transformed onto a logarithmic scale. Some tfx_response observations were reported as less than 1 and these were mutated to a value of 1 to enable a logarithmic transformation.

Calculate the signal to noise ratio for each transfection response observation

The signal to noise ratio for each observation was defined as the ratio of the transfection response to the initial transfection response (t = 0) of each transfection time series. It was computed by first generating a table of initial values and then joining it to the all_data data frame. The signal to noise ratio was the computed as tfx_response/initial_response.

Determine if the signal is quantifiable and save it to a new categorical variable

The transfection response was defined as quantifiable if the signal to noise ratio for the observation was greater than 1 (meaning the signal is 10 times the noise at the initial time point).

Determine if the transfection conditions produced a responder

A responder is defined as a well with an average transfection intensity signal over the time span of 24-72 hours as greater than 3 times the noise (as defined by the initial transfection response value). Save the condition of whether or not a transfection is a responder to a new variable.

Data Cleaning

Recode categorical variables as factors

The categorical variables of formulation ID, cell type, plate number, nucleic acid, incorporation strategy, and dose were recoded as factors to facilitate analysis.

Remove any NA values from the response variables

The observations with NA for any of the response variables of size, efficiency, and tfx_response were inspected to determine if imputation is appropriate. It was determined that only the tfx_response variable had NA values and these all corresponded to the lung cell type for the DNA nucleic acid on plate 2. Further inspection revealed that only one time point was recorded for this plate- the 72 hour time point. This may be due to an error in the machine reading the samples or an error in exporting the data. As a result, imputation was deemed inappropriate (a single, entire category was missing) and the NA values for tfx_response were simply removed.

Inspect the final result of the data import, wrangling, and cleaning.

The data frame containing all data for the experiment was then inspected to confirm successful import, wrangling, and cleaning. The full data frame was exported as a file named all_data.csv. A codebook for the all_data.csv is provided as a word document in the project folder.

3. Exploratory Data Analysis (EDA)

Optimal timepoint and dosing

The first research question was answered primarily with exploratory data analysis/visualization followed by multivariable regression. In order to determine the optimal timepoint and dose for analysis, first the wells that did not respond to transfection were eliminated. For the purposes of this analysis, this was taken as any transfection observation that maintained an average signal to noise ratio of 10 between 24 and 72 hours. The noise value was taken as the transfection reading for timepoint 0.

The summary statistics of the transfection signal response aggregated with respect to dose are given in Table 3-1.

Table 3-1: Summary statistics for the responders in the sample data set. The count is the count of individual transfection wells that met the criteria for a responder over the time period studied.

Dose Count Min Max Average SD SE CV
High 8020 0 7.21 4.29 1.24 0.01 0.29
Low 5600 0 6.94 3.85 1.12 0.01 0.29
Medium 7005 0 7.22 4.13 1.21 0.01 0.29

As shown, the count of individual transfection samples that responded was highest in the highest dose. Additionally, the average transfection response signal is highest in this group as well. The relationship between dose and responder count is given in Figure 3-1. The graph illustrates that the responder count was the highest in the highest dose.

Figure 3-1: Bar chart of distribution of responders vs. the three doses. The definition of a responder is given in the data wrangling section.

The average transfection signal was also evaluated for responders with respect to time and is given in Figure 3-2. As expected, the transfection response for the high dose was the highest for all time point considered during the transfection. The response also appears to level out after 24-36 hours.

Figure 3-2: Line graph of average transfection response over time. The gray shaded area surrounding the curve represents the 95% confidence interval of a model fitted using the loess method.

The percent quantifiable for all observations was also plotted vs. the dose over the time period of the experiment and is given in Figure 3-3. As shown, the percent quantifiable is highest in the high dose and increases over time. Interestingly the percent quantifiable appears to continue increasing at the end of the experiment, suggesting additional time points may be valuable.

Figure 3-3: Percent of observations that are considered quantifiable over time as defined in the data wrangling section.

Impact of incorporation, nucleic acid, and cell type on transfection response

In order to evaluate the impact of the three categorical variables incorporation strategy, nucleic acid on the average transfection response, a table of aggregated summary statistics was prepared in Table 3-2. As shown in the table there are categories of data that are missing (no DNA samples for immune or muscle cells). This is due to the fact the original experiment could not be completed in the time frame available. According to Table 3-2 there are high counts and therefore low standard errors for each group. This bodes well for the intended ANOVA evaluation of the groups.

Table 3-2: Summary statistics for the average transfection response with respect to the categorical experiment variables of incorporation, nucleic acid, and cell type.

Cell Type Incorporation Nucleic Acid Type Average SD Count SE Min Max
Immune Adsorbed mRNA 3.44 0.65 2496 0.01 0.00 5.13
Immune Encapsulated mRNA 3.35 0.62 2522 0.01 0.58 4.86
Liver Adsorbed DNA 3.56 1.02 1690 0.02 0.00 5.58
Liver Adsorbed mRNA 4.09 0.83 1859 0.02 1.25 5.57
Liver Encapsulated DNA 3.50 1.02 1690 0.02 0.28 5.59
Liver Encapsulated mRNA 3.81 0.93 1092 0.03 0.49 5.26
Lung Adsorbed DNA 5.33 1.33 1250 0.04 0.00 7.22
Lung Adsorbed mRNA 4.83 1.38 2639 0.03 0.00 6.84
Lung Encapsulated DNA 5.02 1.22 1175 0.04 0.00 7.14
Lung Encapsulated mRNA 4.62 1.16 2184 0.02 0.00 6.64
Muscle Adsorbed mRNA 4.07 0.76 806 0.03 2.21 5.83
Muscle Encapsulated mRNA 4.46 0.98 1222 0.03 2.39 6.43

In order to visualize the impact of each category on the aggregated sample statistics, a chord diagram was constructed for the average transfection response and the number of responders for each group in Figure 3-4. Examination of the circle perimeter reveals the lung exhibited more transfection than the liver, but it is unclear compared to other cell types due to missing DNA data. Furthermore, the average transfection response appears to be similar overall for encapsulated and adsorbed samples. With respect to nucleic acid, it appears the mRNA and DNA performed similarly for the lung and liver. In the responder count chord diagram, it appears there are a similar amount of responders in the encapsulated and absorbed groups. Additionally, it appears the mRNA led to a greater responder count.

Figure 3-4: Chord diagrams for the average transfection response (top) and responder count (bottom) for each group. Blue represents DNA and red represents mRNA.

A box plot was prepared to visualize the transfection response distribution across all groups and is given in Figure 3-5. This plot corroborates the findings of the chord diagram that there is a smaller difference between adsorbed and encapsulated formulations. Additionally, there may be significant outliers in the data for the immune and muscle cell types. Without a specific reason to remove these, they were left in for analysis.

Figure 3-5: Box plot for transfection response vs category.

Finally, a histogram was prepared for each group with respect to average transfection efficiency and is given in Figure 3-6 for mRNA and Figure 3-7 for DNA. As shown, many of the distributions are not normal. However, the count was determined to be very large for each group meaning normality of the mean can be assumed for each group by the central limit theorem.

Figure 3-6: Histograms of the mRNA transfection responses with respect to the incorporation and cell type variables.

Figure 3-7: Histogram of DNA transfection responses with respect to the incorporation and cell type variables.

Impact of formulation components in transfection

In order to visualize the impact of the formulation components as scatter plot overlaid with a simple linear model was created for the average transfection response with respect to each cell type and formulation component. It is given in Figure 3.8. As shown, there appears to be some slight variation in the transfection response data with respect to MC3 and DOTAP. However, due to the constrained nature of the design space, it is not possible to determine with this simple data visualization in the formulation component has a meaningful impact on performance.

Figure 3-8: Scatter plots with linear model overlay of average transfection response with respect to percent levels of each formulation component with respect to each cell type studied.

Predicting transfection response with size and incorporation efficiency

In order to visualize the effect of size and incorporation efficiency on transfection response, a 3D scatter plot was prepared which can be interacted with in the report generated from the R Markdown in Figure 3-9. As shown, the highest values appear to be aggregated in areas with high incorporation efficiency.

Figure 3-9: Interactive 3D scatterplot of average formulation transfection response with respect to its measured size and efficiency.

To more clearly visualize potential trends in the transfection data with respect to each individual variable, a 2D scatter plot was prepared for the average transfection response with respect to each individual variable as given in Figure 3-10. This graph similarly demonstrates a potential correlation of incorporation efficiency to transfection response, but unlikely with respect to size.

Figure 3-10: 2D scatter plots of transfection response with respect to each independent variable of size and transfection efficiency.

4. Statistical Inference and Regression Modeling

The significance level for all tests performed in the following sections is given as α = 0.05.

Optimal timepoint and dosing

In order to evaluate the optimal dose and time point, multivariable regression modeling was performed for the transfection response and logistic regression was performed with the percent quantifiable response. To facilitate linear modeling of the response with respect to time, only time points where the transfection response/percent quantification appeared to level off in exploratory data analysis were used (42-60 hrs). The equation for the transfection response is given below where R-hat is the transfection response, T is the time, and D is the dose.

\[ \hat{R} = \beta_0 + \beta_1 T + \beta_2 D + \beta_3TD + \epsilon \]

The model generated confirmed a significant intercept, but no significant parameters for coefficient of the time, dose, or interaction between the two. Additionally, some of the assumptions were determined to be severely violated including normality of residuals (p-value of KS test was extremely small and less than the significance level) as well as the assumption of no multicolinearity (all VIFs were greater than 5). This suggests the linear model for the domain is not valid. The ANOVA determined that the effect of the dose was significant, despite the fact the model coefficient was not. Reviewing some of the data visualizations confirmed that over the domain considered, there are minimal changes in the dependent variable of transfection response with respect to dose and time.

The equation for the percent quantifiable was similar except the dependent variable was the logit of the proportion of quantifiable data. The equation is given below where p-hatq is the proportion of quantifiable observations, T is the time, and D is the dose.

\[ \ln\left(\frac{\hat{p_q}}{1 - \hat{p_q}}\right) = \beta_0 + \beta_1 T + \beta_2 D + \beta_3TD + \epsilon \]

A similar result was found with the logistic regression model for proportion of quantifiable observations with respect to time and dose. None of the parameters in the model except the intercept were deemed significant and there were severe violations of the normality of residuals as well as the absence of multicolinearity. Data visualization previously revealed that there is minimal change in these response parameters with respect to time and dose, thus explaining the poor modeling results.

Given the inability to predict the optimal of time and dose over the domain of interest, the dose will be taken as high and a time range of 42-60 hrs. will be recommended as the time frame to attain the maximal transfection response and percent quantifiable.

Impact of incorporation, nucleic acid, and cell type on transfection response

To understand the nature of the impact of the categorical variables of incorporation, cell type, and nucleic acid type on the transfection response, ANOVA was utilized. Although 3 categorical variables would imply the use of three-way ANOVA for the entire data set, some of the groups are missing data as the experiment was never fully completed. The lung and liver cells were missing data pertaining to DNA formulations. The problem was addressed by running both a two-way and three-way ANOVA. The two-way ANOVA was restricted to the nucleic acid type of mRNA with the explanatory variables of incorporation and cell type. A full three-way ANOVA was also performed restricting the cell types to liver and lung, with three explanatory variables of cell type, incorporation, and nucleic acid.

Two-Way ANOVA for Cell Type and Incorporation Strategy

The hypotheses for the two-way ANOVA are given below:

Main Effect of Cell Type

Ho: μi = μj for all unique combinations of i, j
Ha: μi ≠ μj for at least one unique combination of i, j

where i and j are both unique groups in the set of cell types (Liver, Lung, Muscle, and Immune)

The null hypothesis states that the cell type has no effect on the transfection response with the means of all groups being equal. The alternative hypothesis states that at least one cell type had an average transfection response different from the others.

Main Effect of Incorporation Strategy

Ho: μi = μj for all unique combinations of i, j
Ha: μi ≠ μj for at least one unique combination of i, j

where i and j are both unique groups in the set of (Adsorbed and Encapsulated) incorporation strategies

The null hypothesis states that the incorporation strategy has no effect on the transfection response with the means of all groups being equal. The alternative hypothesis states that at least one incorporation strategy had an average transfection response different from the others.

Interaction of Cell Type and Incorporation Strategy

Ho: μi = μj for all unique combinations of i, j
Ha: μi ≠ μj for at least one unique combination of i, j

where i and j are both unique groups within the unique set of combinations of the incorporation and cell type variables

The null hypothesis states that each unique group within the combinations of the values from the cell type and incorporation variables has an identical transfection response reading. The alternative states that the mean transfection reading from at least one of these groups is different from the others.

The ANOVA table for the two-way ANOVA is given in Table 4-1. As shown, the p-values for all main effects and interactions are less than the significance value. Therefore, all null hypotheses are rejected in favor of the alternatives that there is both a main effect and interaction effect of the two variables.

Table 4-1: The ANOVA result for the two-way ANOVA of transfection response with respect to cell type and incorporation strategy (restricted to mRNA transfections only).

Df Sum Sq Mean Sq F-Value p-value
Cell Type 3 4561.75113 1520.5837115 1620.70035 0
Incorporation 1 38.94661 38.9466146 41.51090 0
Cell Type:Incorporation 3 156.39814 52.1327134 55.56518 0
Residuals 14812 13897.00812 0.9382263 NA NA

The assumptions of the two-way ANOVA were also evaluated. With respect to independence of groups, it can be assumed from a exploratory standpoint that all samples are independent. However, other dependencies may occur due to the result of experimental design that could affect the rigor of conclusions for lead drug candidates in downstream development. The best way to address this potential issue is improved documentation in the upstream experiments with respect to reagents and cell lines used. The normality assumption was tested with the Kolmogorov-Smirnov test and was found to be severely violated with a p-value less than the significance level for this study. However, the counts are very high in each group (n > 500), meaning that normality of the mean can be assumed by the central limit theorem. The assumption of homoscedasticity was tested with Levene’s test and was also found to be severely violated. To address this, ANOVA was rerun utilizing Wilcox’s Robust ANOVA which can be used in situations where normality and homoscedasticity are violated. The conclusions of this more robust test were identical to the standard ANOVA. Additionally, the p-values for the impact of each variable are extremely low. This suggests even if some degrees of freedom were lost due to violation of assumptions, the p-value would still be lower than the significance level.

Tukey’s post-test was also run on the ANOVA to determine which groups were statistically significant. The test found that all individual comparisons of groups were significant with respect to the main effect variables. Additionally, most interaction comparisons were significant.

Three-Way ANOVA for Nucleic Acid, Cell Type, and Incorporation

The hypotheses for the three-way ANOVA are summarized below:

For each main effect variable of incorporation, cell type (only liver and lung groups), and nucleic acid a set of main effect hypotheses were considered.

Main Effect Hypothesis

Ho: μi = μj for all unique combinations of i, j
Ha: μi ≠ μj for at least one unique combination of i, j

where i and j are both unique groups from the set of values for the main effect variable examined.

The null hypothesis states that the variable has no effect on the transfection response with the means of all groups being equal. The alternative hypothesis states that at least one group in the variable considered had an average transfection response different from the others.

Two-way interaction effects were also considered in the ANOVA with each unique combination (Incorporation:Nucleic Acid, Incorporation:Cell Type, Nucleic Acid:Cell Type) having a set of null and alternative hypotheses.

Two-Way Interaction Effect

Ho: μi = μj for all unique combinations of i, j
Ha: μi ≠ μj for at least one unique combination of i, j

where i and j are both unique groups in the set of the unique combinations of the values of the two variables considered

The null hypothesis states that each unique group within the combinations of the values from the two variables has an identical transfection response reading. The alternative states that the mean transfection reading from at least one of these groups is different from the others.

The three-way interaction was also considered in this ANOVA with the set of hypotheses given below:

Three-Way Interaction Effect

Ho: μi = μj for all unique combinations of i, j
Ha: μi ≠ μj for at least one unique combination of i, j

where i and j are both unique groups in the set of the unique combinations of the values of the three variables considered

The null hypothesis states that each unique group within the combinations of the values from the cell type, nucleic acid, and incorporation variables has an identical average transfection response reading. The alternative states that the mean transfection reading from at least one of these groups is different from the others.

The ANOVA table for the three-way ANOVA is given in Table 4-2. As shown, the p-values for all main effects and interactions are less than the significance value except for the interaction of nucleic acid and incorporation (p = 0.058). Therefore, most null hypotheses are rejected in favor of the alternatives except for the interaction of nucleic acid and incorporation where it is plausible the null hypothesis is true.

Table 4-2: The ANOVA result for the three-way ANOVA of transfection response with respect to nucleic acid, cell type, and incorporation strategy (restricted to liver and lung cell types only).

Df Sum Sq Mean Sq F-Value p-value
Nucleic Acid 1 176.649616 176.649616 135.075548 0.0000000
Cell Type 1 4231.780471 4231.780471 3235.840980 0.0000000
Incorporation 1 170.407625 170.407625 130.302595 0.0000000
Nucleic Acid:Cell Type 1 618.985279 618.985279 473.308563 0.0000000
Nucleic Acid:Incorporation 1 4.693005 4.693005 3.588518 0.0582014
Cell Type:Incorporation 1 4.236713 4.236713 3.239613 0.0718998
Nucleic Acid:Cell Type:Incorporation 1 20.595841 20.595841 15.748658 0.0000727
Residuals 13571 17747.934194 1.307784 NA NA

Similar to the two-way ANOVA, the assumptions were considered for the three-way ANOVA. The independence assumption can be validated with the same caveats due to the fact the data is from the same study. The normality was similarly violated by the Kolmogorov-Smirnov test but analysis of counts in each group permits reliance on the central limit theorem (n > 1000). Levene’s test also revealed that homoscedasticity cannot be validated, so the test was rerun utilizing the robust Wilcox method. The test results were identical. However, the robust test indicated that the p-value for the effect of nucleic acid of the transfection response is barely significant (p = 0.042) meaning that degree of freedom loss may compromise the conclusion that nucleic acid has a significant effect on the average transfection reading.

Tukey’s post-test was similarly run to compare all groups to all groups in the ANOVA. With respect to the main effects, all group comparisons were significant. With respect to two-way interactions, nearly all group comparisons were significant except the comparison of the encapsulated mRNA group to the adsorbed DNA group. All three-way interactions were deemed significant as well except the comparison of the encapsulated DNA formulations on liver cells to the adsorbed DNA formulations on liver cells. Additionally, the p-values for all significant comparisons were diminutive (p < 0.001), indicating strong conclusions that may offset lack of robustness due to assumption violations.

Impact of formulation components in transfection

To understand how individual components impact transfection response a Scheffe linear model was constructed. The Scheffe linear model is a special form of linear regression where each dependent variable is the proportion of an individual formulation component in the overall mixture.7 It accounts for the inherent perfect multicolinearity between the five formulation components as they are required to sum to 1 as well as constraints on the proportions of each component. All models were constructed with values from the high dose and the observations made between the 42-60 hours. Only responders and quantifiable data were considered. For each cell type, an optimal group was selected based on the groups that gave the maximum average transfection response variable and are given in Table 4-3.

Table 4-3: Summary of groups used for modeling the formulation component response.

Cell Type Nucleic Acid Incorporation
Lung DNA Adsorbed
Liver mRNA Adsorbed
Muscle mRNA Encapsulated
Immune mRNA Adsorbed

All models were the Scheffe linear given by the equation below where each xi is the proportion of one of the individual components.

\[ \hat{y} = \sum_{i=1}^{q} \beta_i x_i + \epsilon \]

The model was fit and all model coefficients were found to be significant for each group except the PEG component in the lung and liver models. The adjusted R2 values for each model were above 0.97. The model coefficients cannot be interpreted as a normal linear model due to the domain constraints imposed by the experiment. The most valid interpretation of each coefficient is that it is the expected response when the associated formulation component is 100% of the mixture and all other formulation components are zero. This condition is not in the constrained domain space and is likely to be an invalid conclusion due to the unique role of each component in the transfection process.

The assumptions of independence, normality, and homoscedasticity were also considered for these models. The assumption of independence is considered satisfied with the same caveats given to the ANOVA tests ran on the category data. The assumptions of normality and homoscedasticity were visualized with plots of the model residuals in Figure 4-1. To assume normality and homoscedasticity, the residuals should be randomly distributed across the x-axis with no apparent pattern as the index increases. In this case there is no major deviation from these conditions, however there are a few problematic areas on some plots. For example, the plot of lung residuals has a region in the index of 0-30 where the residuals are bunched together.

Figure 4-1: Plots of residuals for the four cell type models.

The assumptions of normality were assessed with the Shapiro-Wilk test on the residuals and the Breusch-Pagan test for homoscedasticity in linear models. The results of all the tests are presented in Table 4-4. The lung model was the most problematic with rejections of the null hypotheses of normality and homoscedasticity. However, the p-values for significant parameters are highly significant (p < 0.001) along with a strong correlation (R2 = 0.98) indicating the model findings may be robust enough to overcome violations of these assumptions.

Table 4-4: Summary of assumption tests for cell type Scheffe models. A p-value of less than 0.05 means that the associated assumption is rejected.

Cell Type Normality P-Value Homoscedasticity P-Value
Lung 0.000 0.006
Liver 0.237 0.260
Muscle 0.639 0.133
Immune 0.124 0.937

Predicting transfection response with size and incorporation efficiency

A multivariate linear regression model was built to determine if the transfection response could be predicted by the size and efficiency. The formula for the model is given below where R is the transfection response, S is the size, and E is the incorporation efficiency.

\[ \hat{R} = \beta_0 + \beta_1 S + \beta_2 E + \epsilon \] The results of the regression analysis determined the β2 parameter for efficiency was significant as well as the β0 parameter for the intercept. A follow up ANOVA analysis confirmed this result. The adjusted R2 for the model was 0.2616, indicating overall a poor fit. The correlation coefficient of the response with respect to the efficiency was 0.543, indicating a modest ability to predict transfection response with respect to this variable.

The assumptions were assessed in the context of a two variable linear regression model. The same assumption of independence was accepted as valid with the same caveats noted in previous sections. The normality assumption was evaluated with a Shapiro-Wilk test and was found to be violated. Homoscedasticity in the model was verified with the Breusch-Pagan test. Despite the violation of normality, the test conclusions are likely valid due to the high significant nature of the results (p < 0.001). The assumption of no multicolinearity was verified through VIF calculations. The VIF for size and efficiency was 1.03, indicating minimal multicolinearity.

5. Data Visualization and Results

Optimal timepoint and dosing

As denoted in section 4, the regression model for both transfection response and percent quantifiable were not significant with respect to either variable. The model is given overlaid with the aggregate average data Figure 5-1. As confirmed by observation of the points on both graphs, there is minimal variation in the transfection response with respect to time and dose. With respect to time, this is not surprising as the time points were specifically selected to demonstrate the leveling off effect. However, the dose was determined to be a significant factor for the average transfection response in the related ANOVA. This is supported by the small increase in the transfection response with respect to the dose on the transfection response plot. Visual analysis of the percent quantifiable with respect to the dose and time does not yield any obvious evidence of trends. Therefore, it is recommended that the dose be set to high for future experiments, and any time point on the range of 42-60 hours can be selected. Ideally the dose should be evaluated with respect to cell viability. This could not be completed due to time constraints on the overall experiment. Given the minimal variation in response variables for these models, it may be useful for future analyses to use a different response variable. One possibility is a variable akin to the cycle threshold in PCR. Instead of using an average signal, the time at which the signal crosses a threshold is used as the response. This may tease out the more nuanced differences in transfection performance that would be apparent in a more relevant biological system (i.e. an animal model).

Figure 5-1: Interactive 3D scatterplots of the response vs. time and dose (blue points) overlaid with the multivariate regression model (red plane). Two responses were considered (top) the average transfection signal and (bottom) the percent quantifiable.

Impact of incorporation, nucleic acid, and cell type on transfection response

The impact of incorporation, nucleic acid, and cell type were determined with individual ANOVAs. The ANOVAs found the effects of the main variables were significant as well as most of the interaction terms. Post hoc tests found that all comparisons of main effect variable groups were significant, thus permitting for simple visual interpretation of the bar chart representing the average transfection response with respect to these variables in Figure 5-2. It is clear the lung cells had the strongest response and immune cells had the lowest response. A comparison with respect to nucleic acid is not easily made as data for DNA is missing. However, it appears mRNA is more effective in liver cells than DNA and the reverse is true for lung cells. Generally, the adsorbed samples appear to have slightly outperformed the encapsulated samples, however the difference is minor. As a result of these findings, the groups with the highest average transfection response were used for each to model the impact of formulation components in each cell type as summarized in Table 4-3.

Figure 5-2: Grouped bar chart of the average transfection response with respect to the variables of incorporation strategy, cell type, and nucleic acid. All pairwise comparisons of groups within a specific variable are significant by ANOVA with Tukey’s post-test.

Impact of formulation components in transfection

For final evaluation of the effects of formulation components, effects plots were constructed for each cell model. Owing in part to the unique nature of the constrained design space (all components must sum to 1), normal interpretation of linear regression with respect to individual model terms is not appropriate. Instead, an effect plot shows the response associated with varying the formulation component in question along a line from the centroid of the design space to a hypothetical vertex where the formulation is 100% of the component considered. This automatically updates the values of the other constrained model variables and allows for determination of the component impact on formulation performance. If the predicted response increases with increasing positive distance from the centroid, the component is considered to enhance the formulation. If the response decreases as the positive distance from the centroid increases, it is considered to be detrimental to the formulation.

The effects plots for the four models constructed are given in Figure 5-3. For the lung, DOTAP appears to impart a positive impact on the formulation performance which is an interesting observation as it is in line with previous reports demonstrating permanently cationic lipids like these result in biasing the formulation to the lung tissue following systemic administration.5 PEG and DSPC were found to decrease performance. For the liver, DSPC was found to increase transfection performance with the other components leading to slight decreases. This contradicts the literature surrounding in vivo administration of these particles as cholesterol is known to enhance liver targeting.1 Muscle cell response increased with increasing DSPC and PEG, but decreased with cholesterol. Immune cells performance increased with PEG/DOTAP lipids and decreased with others. It is not fully understood how these lipids impact targeting these cell types in vivo.

Figure 5-3: Effects plots for the Scheffe models built upon data for lung, liver, immune, and muscle cells. DO = DOTAP, MC3 = MC3, PEG = DMG-PEG, Chol = Cholesterol, DS = DSPC.

Predicting transfection response with size and incorporation efficiency

The dependence of transfection response on size and efficiency was also visualized with an interactive 3D scatterplot Figure 5-4. As expected from the model results, there was a significant positive impact of incorporation efficiency on the transfection response as evidenced by the significant response of the regression plane on this axis. The size appeared to have no effect as there was virtually no change in the response plane in this direction. Although a linear model was not sufficient to predict transfection response from efficiency response alone, it may be combined with other significant variables or fit to smaller subsets of the data. Qualitative analysis revealed that the incorporation efficiency must be high in order to have a strong response. However, it is not true that high incorporation efficiency guarantees a strong response as many low responders had higher incorporation efficiency. This is a valuable finding as prospective formulation candidates can be eliminated earlier in the pipeline if they are found to have low incorporation efficiency.

Figure 5-4: Interactive 3D scatterplot of average transfection response with respect to size and incorporation efficiency (points) overlaid with a multivariate linear regression model plane (red).

6. Conclusion

This experiment and subsequent analysis were highly insightful in evaluation of the various process and drug properties on the capability of lipid/polymer nanoparticles to deliver nucleic acids to a variety of cell types in vitro. It was determined from a transfection parameter standpoint that the highest dose leads to the largest transfection response and percent of observations that are reliably quantifiable. The transfection response was found to level off over time and there was no difference noted in the range of 42-60 hours. Therefore, any time point in this range could be used for an average transfection response, and minimization of cell scanning could reduce hard drive storage requirements on the IncuCyte for future experiments. It was determined in subsequent analysis that a parameter akin to a cycle threshold may be a better parameter to evaluate cell response to the formulation as the variation in average transfection response across most groups was minimal, most likely due the saturation effect.

With respect to categorical process and drug properties, it was found that the incorporation style, nucleic acid type, and cell type all had a meaningful impact on cell response. Overall trends demonstrated the lung cell had the highest response to the formulations on average and the adsorbed slightly outperformed the encapsulated. The information was used to pick one set of formulation parameters to model the impact of formulation components on the transfection response. Analysis of the effects plots generated from Scheffe models resulted in some formulation component impacts in line with in vivo models and some that are not. Early predictions of these correlations could permit for more intelligent selection of formulations to evaluate in vivo as mouse experiments are significantly more involved than cell transfection experiments. Finally, the average transfection response of each formulation was considered with respect to the formulation’s size and incorporation efficiency. It was found the efficiency alone was modestly correlated with transfection response. Size has been determined to be an important variable for in vivo performance, thus illustrating the limitations of this in vitro test.8

Future studies should focus on a further constrained formulation design space based on the ranges of formulation components that were determined to lead to high percentages of responders and quantifiable data. The trends elucidated in vitro should also be tested in vivo for this system as this is a unique combination of polymers and lipids, differing significantly from common formulations for nucleic acid delivery. Additional experiments could also focus on more relevant in vitro models such as those that involve 3D spheroid cultures.9 Continued experimentation on the impact of formulation components and process parameters on nucleic acid therapeutic efficacy can fully unlock the true potential of this novel drug class.

7. References

  1. Meyer, RA, et. al. “Targeting strategies for mRNA delivery.” Materials Today Advances. 14, pp. 100240. 2022. DOI: 10.1016/j.mtadv.2022.100240.

  2. Polack, FP, et. al. “Safety and Efficacy of the BNT162b2 mRNA Covid-19 Vaccine.” The New England Journal of Medicine. 383(27), pp. 2603-2615. 2020. DOI: 10.1056/NEJMoa2034577.

  3. Patel, S, et. al. “Boosting Intracellular Delivery of Lipid Nanoparticle-Encapsulated mRNA.” Nano Letters. 17(9), pp. 5711-5718. 2017. DOI: 10.1021/acs.nanolett.7b02664.

  4. Guan, S. et. al. “Nanotechnologies in delivery of mRNA therapeutics using nonviral vector-based delivery systems.” Gene Therapy. 24(133). 2017. DOI: 10.1038/gt.2017.5.

  5. Cheng, Q. et. al. “Selective organ targeting (SORT) nanoparticles for tissue-specific mRNA delivery and CRISPR-Cas gene editing.” Nature Nanotechnology. 15(4). 2020. DOI: 10.1038/s41565-020-0669-6.

  6. Snee, Ronald, and Roger Hoerl. Strategies for Formulations Development: A Step-by-Step Guide Using JMP. SAS Institute Inc., 2016.

  7. Oguaghamba, OA, et. al. “Modified and generalized full cubic polynomial response surface methodology in engineering mixture design.” Nigerian Journal of Technology. 38(1), pp. 52-59. 2019. DOI: 10.4314/njt.v38i1.8.

  8. Hassett, KJ et. al. “Impact of lipid nanoparticle size on mRNA vaccine immunogenicity.” The Journal of Controlled Release. 10(335), pp. 237-246. 2021. DOI: 10.1016/j.jconrel.2021.05.021.

  9. Vigil, TN et. al. “Expediting in vitro characterization of mRNA-based gene therapies via high-content fluorescent imaging.” Analytical Biochemistry. 627, pp. 114259. 2021. DOI: 10.1016/j.ab.2021.114259.